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DETECTION OF SOUND ACTIVITY 
[01] This application claims the benefit of U.S. Provisional Patent No. 60/251,749 filed on 
December 4, 2000. 

BACKGROUND OF THE INVENTION 
[02] This invention relates in general to systems for transmission of speech and, more 

specifically, to detecting speech activity in a transmission. 

[03] The purpose of some speech activity detection algorithms, or VAD algorithms, for 
transmission systems is to detect periods of speech inactivity during a transmission. During 
these periods a substantially lower transmission rate can be utilized without quahty reduction 
to obtain a lower overall transmission rate. A key issue in the detection of speech activity is 
to utilize speech features that show distinctive behavior between the speech activity and 
noise. A number of different features have been proposed in prior art. 

Time domain measures 
[04] In a low background noise environment, the signal level difference between active 
and inactive speech is significant. One approach is therefore to use the short-term energy and 
tracking energy variations in the signal. If energy increases rapidly, that may correspond to 
the appearance of voice activity, however it may also correspond to a change in background 
noise. Thus, although that method is very simple to implement, it is not very reliable in 
relatively noisy environments, such as in a motor vehicle, for example. Various adaptation 
techniques and complementing the level indicator with another time-domain measures, e.g. 
the zero crossing rate and envelope slope, may improve the performance in higher noise 
environments. 

Spectrum measures 

[05] In many environments, the main noise sources occur in defined areas of the frequency 
spectrum. For example, in a moving car most of the noise is concentrated in the low 
frequency regions of the spectrum. Where such knowledge of the spectral position of noise is 
available, it is desirable to base the decision as to whether speech is present or absent upon 
measurements taken from that portion of the spectrum containing relatively little noise. 
[06] Numerous techniques are known that have been developed for spectral cues. Some 
techniques implement a Fourier transform of the audio signal to measure the spectral distance 



between it and an averaged noise signal that is updated in the absence of any voice activity. 
Other methods use sub-band analysis of the signal, which are close to the Fourier methods. 
The same applies to methods that make use of cepstrum analysis. 
[07] The time-domain measure of zero-crossing rate is a simple spectral cue that 
5 essentially measures the relation between high and low frequency contents in the spectrum. 
Techniques are also known to take advantage of periodic aspects of speech. All voiced 
sounds have determined periodicity-whereas noise is usually aperiodic. For this purpose, 
autocorrelation coefficients of the audio signal are generally computed in order to determine 
the second maximum of such coefficients, where the first maximum represents energy. 

1 0 [08] Some voice activity detection (VAD) algorithms are designed for specific speech 

coding appUcations and have access to speech coding parameters from those applications. An 
example is the G729 apphcation, which employs four different measurements on the speech 

f =i segment to be classified. The measured parameters are the zero-crossing rate, the full band 
speech energy, the low band speech energy, and 10 line spectral fi-equencies from a linear 

CfB prediction analysis. 

%Q 

^■2 Problems with conventional solutions 

» [09] Most VAD features are good at separating voiced speech from unvoiced speech, 
rii Therefore the classification scenario is to distinguish between three classes, namely, voiced 
H speech, imvoiced speech, and inactivity. When the background noise becomes loud it can be 
pO difficult to distinguish between active unvoiced speech and inactive background noise. 

Virtually all VAD algorithms have problems with the situation where a single person is also 
talking over background noise that consists of other people talking (often referred to as 
babble noise) or an' interfering talker. 

Likelihood ratio detection 
25 [10] A classic detection problem is to determine whether a received entity belongs to one 
of two signal classes. Two hypotheses are then possible. Let the received entity be denoted 
r , then the hypotheses can be expressed: 

Ho- r^So 

where and are the signal classes. A Bayes decision rule, also called a likelihood ratio 
30 test, is used to form a ratio between probabilities that the hypotheses are true given the 
received entity r . A decision is made according to a threshold : 
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T . ._ Pr(r|//.) f^^. choose /f. 
"^""^ Pr(r|i/J [<r, choose //o 
The threshold is determined by the a priori probabiUties of the hypotheses and costs for 
the four classification outcomes. If we have uniform costs and equal prior probabilities then 
= 1 and the detection is called a maximum likelihood detection. A common variant used 
5 for numerical convenience is to use logarithms of the probabilities. If the probability density 
functions for the hypotheses are known, the log likelihood ratio test becomes: 
(Vr{r\H,)\ , f/^,(r)V>r choose//, 

Gaussian mixture modeling 

[11] Likelihood ratio detection is based on knowledge of parameter distributions. The 
3D density functions are mostly unknown for real world signals, but can be assumed to be of a 
7' simple, e.g. Gaussian, distribution. More complex distributions can be estimated with more 
* general probability density function (PDF) models. In speech processing, Gaussian mixture 

(GM) models have been successfully employed in speech recognition and in speaker 

identification. 

1 5 [12] A Gaussian mixture PDF for d -dimensional random vectors, x, is a weighted sum of 
=: densities: 

where p,^ are the component weights, and the component densities f^,^x,{^ are Gaussian 
with mean vectors and covariance matrices . The component weights are constrained 

M 

20 by > 0 and = 1 • 

k=i 

Adaptive algorithms 

[13] The GM parameters are often estimated using an iterative algorithm known as an 
expectation-maximum (EM) algorithm. In classification applications, such as speaker 
recognition, fixed PDF models are often estimated by applying the EM algorithm on a large 
25 set of training data offline. The results are then used as fixed classifiers in the application. 
This approach can be used successfully if the application conditions (recording equipment, 
background noise, etc) are similar to the training conditions. In an environment where the 
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conditions change over time, however, a better approach utilizes adaptive techniques. A 
common adaptive strategy in signal processing is called gradient methods where parameters 
are updated so that a distortion criterion is decreased. This is achieved by adding small 
values to the parameters in the negative direction of the first derivative of the distortion 
5 criterion with respect to the parameters. 

BRIEF DESCRIPTION OF THE DRAWINGS 
[14] The present invention is described in conjunction with the appended figures: 
[15] FIG. 1 presents an overview block diagram of an embodiment of a transmitting part of 
a speech transmitter system; 
10 [16] FIG. 2 A presents an overv^iew block diagram of a first embodiment of a VAD 
algorithm system; 

[17] FIG. 2B presents an overview block diagram of a second embodiment of a VAD 
Q algorithm system; 

53 [18] FIG. 3 presents an overview block diagram of an embodiment of a feature extraction 
unit; 

[19] FIG. 4A presents an overview block diagram of the first embodiment of a 
classification unit; 

rij [20] FIG. 4B presents an overview block diagram of the second embodiment of a 
% classification unit; 

Wo [21] FIG. 5 presents a flow diagram of an embodiment of a hangover algorithm; and 

[22] FIG. 6 presents an overview block diagram of an embodiment of a model update imit. 
[23] In the appended figures, similar components and/or features may have the same 
reference label. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 
25 [24] The ensuing description provides preferred exemplary embodiment(s) only, and is not 
intended to limit the scope, apphcabihty or configuration of the invention. Rather, the 
ensuing description of the preferred exemplary embodiment(s) will provide those skilled in 
the art with an enabling description for implementing a preferred exemplary embodiment of 
the invention. It being understood that various changes may be made in the function and 
30 arrangement of elements without departing from the spirit and scope of the invention as set 
forth in the appended claims. 
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[25] An ideal speech detector is highly sensitive to the presence of speech signals while at 
the same time remaining insensitive to non-speech signals, which typically include various 
types of environmental background noise. The difficulty arises in quickly and accurately 
distinguishing between speech and certain types of noise signals. As a result, voice activity 
5 detection (VAD) implementations have to deal with the trade-off situation between speech 
clipping, which is speech misinterpreted as inactivity, on one hand and excessive system 
activity due to noise sensitivity on the other hand. 

[26] Standard procedures for VAD try to estimate one or more feature tracks, e.g. the 
speech power level or periodicity. This gives only a one-dimensional parameter for each 
1 0 feature and this is then used for a threshold decision. Instead of estimating only the current 
feature itself, the present invention dynamically estimates and adapts the probability density 
function (PDF) of the feature. By this approach more information is gathered, in terms of 
1:3 degrees of freedom for each feature, to base the final VAD decision upon, 
y [27] In one embodiment, the classification is based on statistical modeling of the speech 
Cft features and likelihood ratio detection. A feature is derived from any tangible characteristic 
5 of a digitally sampled signal such as the total power, power in a spectral band, etc. The 
second part of this embodiment is the continuous adaptation of models, which is used to 
L=i obtain robust detection in varying background environments. 

n [28] The present iirvention provides a speech activity detection method intended for use in 
'Jo the transmitting part of a speech transmission system. One embodiment of the invention 
M includes four steps. The first step of the method consists of a speech feature extraction. The 
second step of the method consists of log-likelihood ratio tests, based on an estimated 
statistical model, to obtain an activity decision. The third step of the method consists of a 
smoothing of the activity decision for hangover periods. The fourth step of the method 
25 consists of adaptation of the statistical models. 

[29] Referring first to FIG. 1 , a block diagram for the transmitting part of a speech 
transmitter system 100 is shown. The sound is picked up by a microphone 110 to produce an 
electric signal 120, which is sampled and quantized into digital format by an A/D converter 
130. The sample rate of the sound signal is chosen to be adequate for the bandwidth of the 
30 signal and can typically be 8KHz, or 16KHz for speech signals and 32 KHz, 44. 1 KHz or 
48KHz for other audio signals such as music, but other sample rates may be used in other 
embodiments. The sampled signal 140 is input to a VAD algorithm 150. The output 160 of 
the VAD algorithm 150 and the sampled signal 140 is input to the speech encoder 170. The 
speech encoder 170 produces a stream of bits 180 that are transmitted over a digital channel. 
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VAD procedure 

[30] The VAD approach taken by the VAD algorithm 150 in this embodiment is based on 

a priori knowledge of PDFs of specific speech features in the two cases where speech is 
active or inactive. The observed signal, u(t), is expressed as a sum of a non-speech signal, 
5 n(t), and a speech signal, s(t), which is modulated by a switching function, 6(t) : 
u{t) = 0{t)s{t) + nit) 5^(0 e {0,1} 
[31] The signals contain feature parameters, ;c, and x„ , and the observed signal can be 
written as: 

u{tMt)) = m4'^s(o)+ 

1 0 [32] It is assumed that the feature parameters can be extracted from the observed signal by 
some extraction procedure. For every time instant, t, the probability density function for the 
CD feature can be expressed as: 

I AW = (A^ = 0)Pr(^= 0) + {x\0 = l)Vr{0 = l) 

£ [33] With access to the speech and non-speech conditional PDFs, we can regard the 

1% problem as a likelihood ratio detection problem: 

I /v|^=i (-^0 ) I ^ choose 



< r choose 



Q where x^^ is the observed feature and r is the threshold. The higher the ratio, generally, the 

more likely the observed feature corresponds to speech being present in the sampled signal. 

It is possible to adjust the decision to avoid false classification of speech as inactivity by 
20 letting r < 0 . The threshold can also be determined by the a priori probabilities of the two 

classes, if these probabilities are assvuned to be known. The PDFs for speech and non-speech 

are estimated offline in a training phase for this embodiment. 

[34] With reference to FIGS. 2A and 2B, embodiments of VAD algorithm systems 150 are 
shown. The embodiment of FIG. 2A includes a model update unit 260 to adapt the models to 

25 various signal conditions over time to increase likelihood. In contrast, the embodiment of 
FIG. 2B does not adapt over time. The VAD algorithm system 150 consists of four major 
parts, namely, a feature extraction unit 210, classification unit 230, a hangover smoothing 
function 250, and a model update function 260. The VAD algorithm function 150 generally 
operates according to the following four steps. First, a set of speech features are extracted by 

30 the feature extraction unit 210. Second, features 220 produced by the feature extraction 

function 210 are used as arguments in the first classification 230. Third, an initial decision 



6 



240 that is produced from the classification unit 230 is smoothened by the hangover 
smoothing function 250. Fourth, the statistical models in the model update function 260 are 
updated based on the current features such that the models are iteratively improved over time. 
Below each of these four steps are described in further detail. 



[35] An embodiment of the feature extraction unit 210 is depicted in FIG. 3. The sampled 
speech signal 140 is divided into frames 315 of N j-^ samples by the framing unit 320. If the 
frame power 330, as determined by a power calculation unit 325, is below a certain threshold, 



10 the classification. In this embodiment, an N (Nj-, > N ) samples-long discrete fast Fourier 
H transform (FFT) 350 operates upon a zero-padded and windowed frame produced by the 

padding and windowing unit 345. The signal powers in bands, x- , (the 'W powers") 220 
S are calculated by adding the logarithms of the absolute values of the Fourier coefficients in 
^0 each band and normaHzing them with the length of the band with the squared absolute values 
S block 220 and the partial sums block 370. These N powers 220 are the features used in the 

classification. 

O Likelihood Ratio Tests 

r [36J Two embodiments ofthe classification unit 230 are shown in FIGS. 4A and 4B. The 
embodiment of FIG. 4 A interfaces with the embodiment ofthe VAD algorithm system 150 of 

20 FIG. 2A and includes adaptive inputs 270. The embodiment of FIG. 4B interfaces with the 
embodiment ofthe VAD algorithm system 150 of FIG. 2B and does not have an adaptive 
feature. In these embodiments, the A/^ powers 220 or features 220, , are used in iV^ 

parallel -dimensional likelihood ratio generators 420, where N ^^N^ . A likelihood 

ratio 430, 7„ , is calculated with the likelihood ratio generators 420 by taking the logarithm of 
25 a ratio between the activity PDF value and the inactivity PDF value obtained by using the 
feature as arguments to the PDFs: 



Feature Extraction 



, a binary decision variable 215, Vp , is set to zero by a threshold tester 315 for later use in 
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where fj^ denotes the activity PDF, /^'^^ denotes the inactivity PDF, and are N^- 
dimensional vectors formed by grouping the features Xj . A weight calculation unit 425 
determines a weighting factor 440, , for each likelihood ratio 430. A test variable 460, y, 
is then calculated as a weighted sum of the ratios: 



Experimentation may be used to determine the best weighting for each likelihood ratio 430. 
In one embodiment, each likelihood ratio 430 is equally weighted. 

[37] The test variable 460 is compared to a certain threshold, , by a first decision block 
465 to obtain a decision variable 470, ,: 



If an individual channel indicates strong activity by having a large likelihood ratio 430, t]^ , 
greater than another threshold, Tq , then a corresponding variable 450, F„ , is set to equal one 
in a second decision block 445. The initial activity classification 240, F/, is calculated as the 
logical OR of the corresponding and decision variables 450, 470. 
[38] This embodiment of the invention utilizes Gaussian mixture models for the PDF 
models, but the invention is not to be so limited. In the following description of this 
embodiment, N^=l and Nc = N will be used to imply one-dimensional Gaussian mixture 
models. It is entirely in the spirit of the invention to employ a number of multivariate 
Gaussian mixture models. 

Hangover Smoothing 
[39] With reference to FIG. 5, an embodiment of a hangover algorithm 250 is used to 
prevent clipping in the end of a talk spurt. The hangover time is dependent of the duration of 
the current activity. If the talk spurt, , is longer than frames, the hangover time, no, is 
fixed to A^i frames, otherwise a lower fixed hangover time of frames is used as shown in 
steps 508, 516 and 520. A logical AND between the output of the hangover smoothing, F^ , 
and the frame power binary variable 215, Vp , yields the final VAD decision 160, Ff. If 
Vj = \ then Vjj =1 in step 536 and a counter, , is incremented in step 532 to count the 
number of consecutive active frames. Otherwise, if Vj became 0 within the last A^, or 





v,=o 



frames then = 1 shown in steps 512, 524 and 528. If has been 0 longer than or 
frames, then F^, = 0 in steps 512, 524 and 540. 

Model Update 

[40] The parameters of the active and the inactive PDF models are updated after every 
frame in the adaptive embodiment shown in FIG. 2A. Feature data is sampled over time by 
the model update unit 260 to affect operation in the classification unit 230 to increase 
likelihood. The stages of updates are performed by the model update unit 260 depicted in 
FIG. 6. Both the PDF models are first updated by a gradient method for a likelihood ascend 
adaptation using an inactivity likelihood ascend unit 610 and a speech likelihood ascend unit 
620. The inactive PDF model parameters are then adapted to reflect the background by a 
long-term correction 630. Finally, a test is performed to assure a minimum model separation 
640, where the active PDF model parameters may be further adapted. 
Likelihood Ascend 

[41 ] The PDF parameters are updated to increase the likelihood. The parameters are the 
logarithms ofthe component weights, tjr^^ and ^jrJJ , the component means, //J^^ and , 
and the variances, A\^^ and . For notation convenience the symbol a+ = b will in the 
following denote a(n + 1) = a{n) + b{n) , where n is an iteration counter. For the update 
equations we calculate the following probabilities 

«.„ =/r'(-^.("))=2:/';?/;;'(-/")) =/;'"fe<''))=i:/';M'k(«)) 

Ho, 

[42] The logarithms ofthe component weights are updated according to 

where is some constant controlling the adaptation. The component weights are restricted 
not to fall below a minimum weight . They must also add to one and this is assured by 

Pj,k - M P],k - M 

ZpIT TpS 
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[43] The variance parameters are updated as standard deviations 



[44] The variance parameters, X^ ,^ , are restricted not to fall below a minimum value of 
[45] The component means are updated similarly 

[46] As with the component weights, the update equations for the means and the standard 
deviations also contain adaptation constants, and v^, controlling the step sizes. 
Long term correction 

[47] In a sufficiently long window there is most likely some inactive frames. The frame 
with the least power in this window is likely a non-speech frame. To obtain an estimate of 
the average background level in each band we take the average of the least N^^i power values 
of the latest A'^^^^^. frames: 

1 

where xf < xj"^'' are the sorted past feature (power) values 

{x^. (n), Xj{n-\),... , Xj (n - N i,^^,^ ) }. The mixture component means of the non-speech 
PDF are then adapted towards this value according to the equation: 

where the GMM "global" mean is given by 

and the adaptation is controlled by the factor s^^^^ . 
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Minimum model separation 
[48] In order to keep the speech and non-speech PDFs well separated the mixture 
component means of the active PDF are then adjusted according to the equations: 

5 A^;' < A^;™^ //j; + = (a^7"' - A^;' )• 0-95 

where =^p'ui:^'S ' "^T =Z>^5>y5 ' apre-defmed 

minimum distance. In one embodiment, an additional 5% separation is provided by applying 
the above technique. 

[49] While the principles of the invention have been described above in connection with 
1 0 specific apparatuses and methods, it is to be clearly understood that this description is made 
only by way of example and not as limitation on the scope of the invention. 
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